feat: DashboardHygieneAnalyzer (broken panels)#23
Merged
Conversation
…panel/dashboard-hygiene Scope narrows in v0.1 to only broken-panel detection; the old names had no live emitters.
Used by DashboardHygieneAnalyzer to carry the dashboard title. omitempty keeps the wire form clean for non-dashboard findings.
Anchored regex against Finding.Dashboard. Empty field never matches. Wires through config.IgnoreConfig.Dashboard.
PanelTargets walks rows recursively, returns (panel-title, expr) pairs filtered to Prometheus targets. BaseURL is a defensive copy used by the dashboard-hygiene analyzer to build absolute dashboard URLs in Fix snippets.
New analyzer flags Grafana dashboards whose panel queries reference missing Prometheus metrics. Skeleton + nil-Graf warning path; full algorithm in subsequent commits.
Walks every Grafana dashboard, parses Prometheus targets via promqlx, and groups (dashboard, missing-metric) pairs. Severity Medium per the design. Recording-rule outputs and fix snippet come in later commits.
…rename Code-quality follow-ups to the happy-path commit: - add Dashboard tiebreaker to the comparator so output is deterministic across map iterations - extract the comparator into findingLess to keep Analyze under the gocyclo limit (was 15, the extra branch pushed it to 16) - extract sample-cap 5 to a named const for grep-ability - rename buildFinding parameter to mirror the call-site variable name
Same missing metric across multiple panels yields one finding; distinct missing metrics in the same dashboard emit separate findings.
A recording rule whose output is not yet in head series must still be treated as a known metric. Mirrors the resolution flow in unusedmetrics, including the VictoriaMetrics graceful-degrade sentinel.
… test Code-quality follow-ups to the recording-rule resolution commit: - expand the two BuildInfo-adjacent comments to explain WHY each VM-flavor check exists (404 path vs 200-empty-groups path), so a future reader doesn't need to cross-reference unusedmetrics - strengthen the RR test by querying AlertA from a second panel: proves the type filter (r.Type == "recording") actually held, not just that the recording-rule output was added to exists
Grafana template variables like ${metric}_total sanitise to
__remetric_var___total - a valid PromQL identifier that parses
cleanly and leaks into the extracted metric set. Change isSentinel
to a Contains check so any name containing the sentinel substring
is treated as a sanitiser artifact and filtered.
Without this, dashboardhygiene and unusedmetrics would treat
template-variable expressions as references to bogus metrics
named '__remetric_var__*'.
…and Loki targets
Grafana template-variable queries (${metric}_total) and non-Prometheus
datasources must not generate findings or warnings. The
template-variable path relies on promqlx filtering sentinel-derived
metric names; the Loki path relies on PanelTargets filtering by
datasource type.
Per-dashboard fetch errors degrade to warnings without aborting the analyzer. Search() failure is fatal. VictoriaMetrics without --vmalert emits the recording-rules-unavailable warning.
Renders a paste-ready instruction block: restore the metric or remove the broken queries. Drops the URL line when no absolute dashboard URL is available. Caps the panel list at 10 entries with a '... and N more' tail.
… builtin Code-quality follow-ups to the fix-snippet commit: - pull the broken-panel docs URL from findings.DocURL(ClassBrokenPanel) instead of a hardcoded literal, so the single source of truth in internal/findings/ stays authoritative - replace explicit limit-clamping with the Go 1.21+ min() builtin - replace C-style index loop with idiomatic range over the slice
New top-level subject 'dashboards' with one action 'broken'. Requires --prometheus and --grafana. Honors --output, --min-severity, --ignore-dashboard, --ignore-metric, --fail-on, --limit, --timeout.
…pty.go Code-quality follow-ups to the dashboards broken subcommand: - drop the CLI's local re-sort; the analyzer already orders by (severity desc, sample-count desc, dashboard asc, metric asc), and the filter passes are stable. The local sort was discarding the sample-count tiebreaker - meaningful signal that broken-from- many-panels metrics rank higher within a severity tier - move brokenPanelCopy to empty.go for parity with cardinalityCopy and labelPatternCopy - extend TestEmptyCopy_Values to cover brokenPanelCopy and the previously-uncovered unusedMetricsCopy
Both flows include the new analyzer; without --grafana it emits a warning and zero findings (consistent with unusedmetrics).
…l page Real content for the broken-panel finding class. Updates the catalog, the unused-metric cross-link, the mkdocs nav, and the 'What's still missing in v0.1' README section since the analyzer now ships in v0.1.
Provisions a Grafana dashboard whose only panel queries a metric Prometheus does not scrape; runs 'remetric dashboards broken' and asserts the finding is emitted with class=broken-panel.
…text Follow-up to the analyzer landing. Updates user-facing copy that still listed the v0.1 analyzer set as four (now five with broken-panel) and omitted --ignore-dashboard from the ignore-* table.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the last v0.1 analyzer: flags Grafana dashboards whose panel queries reference Prometheus metrics that do not exist (not in head series, not in recording-rule outputs). One finding per
(dashboard, missing-metric)pair, severity Medium.Scope narrowed from spec §6.3: ships only broken-panel detection in v0.1. Untouched-dashboard detection (weak proxy without Grafana Enterprise
meta.viewedAt) and near-duplicate detection (no canonical "panel signature" definition) are deferred per.claude/docs/superpowers/specs/2026-05-23-dashboard-sprawl-analyzer-design.md§11.What's new
internal/analyzers/dashboardhygiene/with happy-path detection, recording-rule resolution, VM-without-vmalert graceful-degrade, silent-skip for template-variable + non-prom datasources, fix-snippet builder.remetric dashboards broken --prometheus <URL> --grafana <URL>(both flags required). Honors all standard flags including new--ignore-dashboard <regex>.scanandreportrunner slices (no-Grafana → warning, parallel tounusedmetrics).Finding.Dashboard stringfield withomitempty.ignore.Patterns.Dashboard+ matching--ignore-dashboardflag.ClassDashboardSprawl→ClassBrokenPanel,CategoryDashboardSprawl→CategoryDashboardHygiene(no live emitters of the old names).Dashboard.PanelTargets()(flat panel-title + expr pairs) andClient.BaseURL()(defensive copy).isSentinelswitched from equality to substring containment. Catches concatenations like${metric}_total→__remetric_var___totalthat previously leaked into the extracted metric set, polluting findings in both dashboardhygiene and unusedmetrics.docs/findings/broken-panel.mdreplaces thedashboard-sprawl.mdplaceholder; mkdocs nav + cross-link inunused-metric.md+ README +--helptext all updated.e2e/dashboards_e2e_test.goprovisions a broken-panel dashboard via file-based Grafana provisioning, asserts the finding.Test plan
go test ./... -count=1 -race(20 packages, all PASS)make fmt vet lint vuln(0 issues, no vulnerabilities)make cover(total 86.1%, dashboardhygiene 85.1% — exceeds 75% floor + 80% target)make e2e(all 8 e2e tests PASS including newTestE2E_DashboardsBroken_JSON)Commits
21 commits with per-task two-stage review (spec compliance → code quality), each commit + fixup is independently buildable + tested. Squash-friendly history; bisect-friendly if anything regresses later.